(α, k)-Minimal Sorting and Skew Join in MPI and MapReduce

نویسندگان

  • Silu Huang
  • Ada Wai-Chee Fu
چکیده

As computer clusters are found to be highly effective for handling massive datasets, the design of efficient parallel algorithms for such a computing model is of great interest. We consider (α, k)-minimal algorithms for such a purpose, where α is the number of rounds in the algorithm, and k is a bound on the deviation from perfect workload balance. We focus on new (α, k)-minimal algorithms for sorting and skew equi-join operations for computer clusters. To the best of our knowledge the proposed sorting and skew join algorithms achieve the best workload balancing guarantee when compared to previous works. Our empirical study shows that they are close to optimal in workload balancing. In particular, our proposed sorting algorithm is around 25% more efficient than the state-of-the-art Terasort algorithm and achieves significantly more even workload distribution by over 50%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SharesSkew: An Algorithm to Handle Skew for Joins in MapReduce

In this paper, we investigate the problem of computing a multiway join in one round of MapReduce when the data may be skewed. We optimize on communication cost, i.e., the amount of data that is transferred from the mappers to the reducers. We identify join attributes values that appear very frequently, Heavy Hitters (HH). We distribute HH valued records to reducers avoiding skew by using an ada...

متن کامل

Handling Skew in Multiway Joins in Parallel Processing

Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In...

متن کامل

Efficient Large Outer Joins over MapReduce

Big Data analytics largely rely on being able to execute large joins efficiently. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially on the extremely popular MapReduce platform. In this paper, we studied several current algorithms/techniques used in large outer joins. We f...

متن کامل

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

he Map/Reduce framework-a parallel processing paradigm-is widely being used for large scale distributed data processing. Map/Reduce can perform typical relational database operations like selection, aggregation, and projection etc. However, binary relational operators like join, cartesian product, and set operations are difficult to implement with Map/Reduce. Map/Reduce can process homogeneous ...

متن کامل

A Scalable and Skew-insensitive Algorithm for Join Operations using Map/Reduce Model

For over a decade, Map/Reduce has become a prominent programming model to handle vast amounts of raw data in large scale systems. This model ensures scalability, reliability and availability aspects with reasonable query processing time. However these large scale systems still face some challenges : data skew, task imbalance, high disk i/o and redistribution costs can have disastrous effects on...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1403.5381  شماره 

صفحات  -

تاریخ انتشار 2014